The fuel price is like the stock prices changing heavily on a day, but mostly not with the same amplitude. But even if you have a look on a day by day base, there are many up and downs. Therefore I want to have a look at the data and see what’s inside. Furthermore I want to build a forecasting model for the price on daily base. I’ll choose this as capstone project to become familiar with time series data and also forecasting model ARIMA. The history of all fuel stations in Germany is available from Tankerkoenig.de at https://dev.azure.com/tankerkoenig/_git/tankerkoenig-data for private use under following license https://creativecommons.org/licenses/by-nc-sa/4.0/ .
I’ll will follow the CRISP-DM:
All necessary information about libraries and so on are in the Readme.md in the github(https://github.com/klogges5/capstone) repository.
Therefore I’ll have a look at the history data(2015-2019). For handling the amount of data I’ll first reduced it to zip codes starting with 40. This leads to a 250 MB sql file. Because github is only storing 100 MB I’ll reduced it further to 8 stations. First I’ll have a look at the data and see what is in. Therefore I’ll use some plot functions to find some insights. For forecast model I’ll use the SARIMAX from (Seabold, 2010) and will build the model based on ACF/PACF. Even it is not necessary in capstone project to use 2 models, I’ll have a look also on LSTM as a forecast model.
import datetime
import pandas as pd
import glob
import numpy as np
from sqlalchemy import create_engine
from helper_functions import save_data, load_data, get_data4uid
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import normalize, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LSTM
from sqlalchemy import create_engine
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import kpss
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from pylab import rcParams
from helper_functions import load_data, get_data4uid
rcParams['figure.figsize'] = (15, 8)
# use prefix for Mac 'Volumes/JMVB' or Windows 'e:/Working'
prefix='/Volumes/JMVB'
First load the file stations.csv to see what is inside. Therefore i loaded a stations file of the last month. Here a short description of the information inside:
uuid,name,brand,street,house_number,post_code,city,latitude,longitude, first_active, openingtimes_json
# change the directory as necessary
stations_pd = pd.read_csv(prefix+'/sprit/stations.csv')
stations_pd.head()
stations_pd.shape
stations_pd.info()
# want to see all stations around zip-code 40xxx
short_pd = stations_pd.dropna(subset=['post_code'])
short_pd = short_pd[short_pd['post_code'].str.match(pat = '40\d{3}')]
The prices are organized in folders for every year and month and is really big data (around 5 GB per year). Therefore i've build a function to load the Data per year and extract only a few data for stations in short_pd. The pricesxxx.csv are in the following format:
date,station_uuid,diesel,e5,e10,dieselchange,e5change,e10change
# let's read one file to see special cases or wrong data
prices_tmp = pd.read_csv(prefix + '/sprit/2015/01/2015-01-01-prices.csv', usecols=['date', 'station_uuid', 'diesel', 'e5', 'e10'])
prices_tmp.head()
prices_tmp.describe()
prices_tmp[~(prices_tmp.diesel < 0.5)].describe()
Min Price of -0.001 is not a real value, so i will replace it with np.nan. Furthermor 2.015 € for all types are max? Let's have a look at this also...
prices_tmp[prices_tmp.diesel > 1.70].head()
prices_tmp[prices_tmp['station_uuid'] == '42d6d4cc-6909-45fc-f60f-a1abd9ea1f0e']
It seems like this station has made an joke for new year ;-) This dataset will be removed later, after all prices are read, because i will focus on some stations only.
# let's read another file to see special cases or wrong data
prices_tmp = pd.read_csv(prefix + '/sprit/2016/01/2016-01-01-prices.csv', usecols=['date', 'station_uuid', 'diesel', 'e5', 'e10'])
prices_tmp.describe()
There is also a min of -0.01 € and a questionable price of 2.028 € as max for diesel? The max for e5 is only 1.619 € and normally the diesel price is cheaper in Germany, than e5 or e10.
I will do a cleaning of all prices and drop all that are below 0.5 €, because the price wasn't so low in the past.
prices_tmp[(prices_tmp.diesel <= 1.70) & (prices_tmp.diesel > 0.5)].describe()
In the above examples i've found some wrong data in this two files. After reading the whole dataset i'll have a look on the statistics to see what can be dropped.
def read_data_year(year, filepath):
'''
read all relevant data for year x
Input: year, filepath with
Output: dataframe with all prices for region 40x and year
'''
print('Read year {}'.format(year))
files = glob.glob('{}{}/*/*'.format(filepath, year))
prices = [pd.read_csv(file, usecols=['date', 'station_uuid', 'diesel', 'e5', 'e10']) for file in files]
prices_pd = pd.concat(prices, sort=False)
data_pd = prices_pd[prices_pd['station_uuid'].isin(short_pd['uuid'])][['date', 'station_uuid', 'diesel', 'e5', 'e10']]
return data_pd
def read_data(filepath='e:/Working/sprit/'):
'''
read all data and make some conversions
Input: filepath to input files, default is given
Output: dataframe with datetimeindex and station_uuid as category
'''
prices = read_data_year('2015', filepath)
prices = prices.append(read_data_year('2016', filepath))
prices = prices.append(read_data_year('2017', filepath))
prices = prices.append(read_data_year('2018', filepath))
prices = prices.append(read_data_year('2019', filepath))
# replace wrong data (<0.5 here for small) with nan
prices['diesel'] = prices['diesel'].apply(lambda x: np.nan if x < 0.5 else x)
prices['e5'] = prices['e5'].apply(lambda x: np.nan if x < 0.5 else x)
prices['e10'] = prices['e10'].apply(lambda x: np.nan if x < 0.5 else x)
# make uuid as category
prices.station_uuid = prices.station_uuid.astype('category')
# date should be datetime
prices.date = pd.to_datetime(prices.date, utc=True)
return prices
# Reading of all relevant data
prices_pd = read_data(prefix + '/sprit/')
prices_pd.info()
prices_pd.describe()
The max values of the fuel types differs too much from the 75% percentile, the values are probable not useful. Let's see which values we can set to NaN. I'll found from looking into the data following values useful.
prices_pd[prices_pd.diesel > 1.7]
prices_pd[prices_pd.e10 > 1.75]
prices_pd[prices_pd.e5 > 1.82]
prices_pd['diesel'] = prices_pd['diesel'].apply(lambda x: np.nan if x > 1.7 else x)
prices_pd['e10'] = prices_pd['e10'].apply(lambda x: np.nan if x > 1.75 else x)
prices_pd['e5'] = prices_pd['e5'].apply(lambda x: np.nan if x > 1.82 else x)
prices_pd.isna().sum()
#drop all rows where all fuel values are NaN
prices_pd.dropna(axis=0, how='all', subset=['diesel', 'e5', 'e10'], inplace=True)
prices_pd.isna().sum()
prices_pd.describe()
Now the data looks homogenous and we can save it to the sqllite database. The NaNs will be handled later, because i'll use only daily means for my forecasting model, most of them will go away.
# for reducing the data to upload it in github, i'll choose only 8 stations
uuids = ['005056ba-7cb6-1ed2-bceb-82ea369c0d2d', '79fb1f24-bebb-489e-841f-728f9053b555', '51d4b59c-a095-1aa0-e100-80009459e03a', '005056ba-7cb6-1ed2-bceb-a46e32000d3e',
'e43c09b0-5baa-4738-962a-c94388e93c30', '82119d5d-775f-42af-ac56-000d7c91413f', 'e7807347-796f-4aac-997d-07d0c988e109', '5bf85d09-ea6b-4146-b23f-4b902e2e1554']
prices_pd = prices_pd[prices_pd['station_uuid'].isin(uuids)][['date', 'station_uuid', 'diesel', 'e5', 'e10']]
# uncomment and run this cell only if you read new csv files, it will overwrite the sql file
#save_data(prices_pd, 'Prices', './Data/prices_40.sql')
#save_data(stations_pd, 'Stations', './Data/prices_40.sql')
If you didn't load the raw data from Tankerkoenig you can start here with loading the data from sqllite database. Here i will have a look at some interesting points out of the data.
# uncomment next line and run this cell to load data from sql file
prices_pd, stations_pd = load_data('./Data/prices_40.sql')
prices_pd.info()
# make uuid as category, is not stored in database
prices_pd.station_uuid = prices_pd.station_uuid.astype('category')
prices_pd.info()
prices_pd.isnull().values.any()
So we have now all prices and stations to have a closer look, how the prices are raising and decreasing. As we see from cell above, we have NaN values insight our data. I will leave it here and drop the NaN's later when filtering was done.
# how does the data look like for one station
prices_pd[prices_pd['station_uuid'] == '005056ba-7cb6-1ed2-bceb-82ea369c0d2d']
We have many data points for a day, as mentioned earlier i will only build a forecast model on daily base. So we will calculate and use the mean of each day.
Now use the predefined function from helper_functions.py to filter only on one station and one fuel type. In the further course i will focus on one station and fuel.
# get data for one station and type "diesel"
star_bahn = get_data4uid(prices_pd, '005056ba-7cb6-1ed2-bceb-82ea369c0d2d', 'diesel')
star_bahn.info()
star_bahn.head(10)
#let's have a look at the data
star_bahn.plot(figsize=(15,8));
In the above cells you could see, that the timestamps vary, that means the stations transmit new prices directly after changing them. I'll want to forecast on a daily basis, therefore i'll resample the data to days and take the mean. Also i will use a ffill to overcome missing values.
# get samples per day
star_day = star_bahn.resample('d').mean().ffill()
star_day.shape
star_day.plot(figsize=(15,8));
star_day.isna().sum()
So we have no np.nan anymore in our dataset, but we have a few day in 2018 without data. Possible the station was closed in this time. I'll think this should not be a problem.
# get another station
total = get_data4uid(prices_pd, '79fb1f24-bebb-489e-841f-728f9053b555', 'diesel')
total_day = total.resample('d').mean().ffill()
total_day.diesel.idxmax()
# plot both stations in one plot to see
plt.figure(figsize=(15,8));
plt.plot(star_day, label = 'Star');
plt.plot(total_day, label = 'Total');
plt.title('Diesel price, Star/Total')
plt.ylabel('price')
plt.legend();
The price of Total diesel price somewhre between july/august of 2017 is one crazy peak. Let's have a closer look...
total_day['2017-07':'2017-08'].diesel.idxmax()
total_day['2017-08-12':'2017-08-18'].diesel
total['2017-08-13':'2017-08-17'].diesel
The data seems to be valid, it could be that some dataset is missing and therefore the step is so big. On 14th there are only two values and on 15+16th of august there is only one value. It is a little bit strange, but it can be. So i will left this data as is.
# want to see the differences from day to day
star_day_diff = star_day.diff()
total_day_diff = total_day.diff()
plt.figure(figsize=(15,8))
plt.plot(star_day_diff)
plt.plot(total_day_diff)
plt.grid(axis='x')
plt.legend(['Star', 'Total']);
between days, but now let's have a closer look in a shorter time frame and the corresponding prices. The two choosen stations are located nearby each other, around 500 m distance.
star_intraday = star_bahn['2017-05-01':'2017-05-07']
total_intraday = total['2017-05-01':'2017-05-07']
plt.figure(figsize=(15,8))
plt.plot(star_intraday)
plt.plot(total_intraday)
plt.grid(axis='x')
plt.legend(['Star', 'Total']);
In the plot above you can easily find who follows whom: The Total is raising the price first and Star is following. In price reduction Star is first and Total follows. Interesting :-)
How does it look during a day? When is the lowest price on a day, when is the highest in the mean over all days? Therefore I make a function to plot the data.
def plot_day4uuid(uuid, typ='diesel', name=None, save=None):
'''
plots a distribution of prices on a day based on last year and for 2019
Input:
uuid : id of station
typ : which type of gas
name : Title for the plot
save : filename to save'''
hourly_ticks = 60 * 60 * np.arange(24)
data = get_data4uid(prices_pd, uuid, typ)
# drop wrong data
data.drop(data[data[typ] < 0.5].index, inplace=True)
data.dropna(inplace=True)
# resample to hourly values
data_h = data.resample('H').mean().dropna()
# groupby hour for all data
data_byday = data_h.groupby(data_h.index.time).mean()
# groupby hour only for 2019
data_2019 = data_h['2019-01-01':]
data_byday_2019 = data_2019.groupby(data_2019.index.time).mean()
# bug in pandas plot
# workaround found in https://github.com/facebook/prophet/issues/999
pd.plotting.register_matplotlib_converters()
plt.figure(figsize=(15,8))
plt.plot(data_byday, label='Hourly Mean')
plt.plot(data_byday_2019, label='Hourly Mean 2019')
if name:
plt.title(name)
else:
plt.title(uuid)
plt.legend()
plt.xticks(hourly_ticks)
if save:
plt.savefig(save)
plot_day4uuid('005056ba-7cb6-1ed2-bceb-82ea369c0d2d', 'diesel', 'Star') #, save='./star_hourly_mean.png')
plot_day4uuid('79fb1f24-bebb-489e-841f-728f9053b555', 'diesel', 'Total')
plot_day4uuid('51d4b59c-a095-1aa0-e100-80009459e03a', name='JET Koelner') #, save='./jetkoelner.png')
plot_day4uuid('005056ba-7cb6-1ed2-bceb-a46e32000d3e', name='Star Richrather') #, save='./starrichrather.png')
star_bahn.describe()
It seems in the examples above that in average it is a good to get fuel in the evening, depending on your preferred station beween 18-19 o'clock (Star) or 16-19 o'clock (Total). The standard deviation is 9 cent, so it can be money saving to get fuel at the right time.
To answer this question i grouped the data by hour and weekday.
def plot_weekday4uuid(uuid, typ='diesel', name=None, save=None):
'''
plots a distribution of prices for weekday based on last years and for 2019
Input:
uuid : id of station
typ : which type of gas
name : Title for the plot
save : filename to save'''
hourly_ticks = np.arange(25)
data = get_data4uid(prices_pd, uuid, typ)
# drop wrong data
data.dropna(inplace=True)
# resample to hourly values
data = data.resample('H').mean().dropna()
# groupby hour only for 2019
data_2019 = data['2019-01-01':].copy()
data['day'] = data.index.weekday
data['hour'] = data.index.hour
data_2019['day'] = data_2019.index.weekday
data_2019['hour'] = data_2019.index.hour
fig, [ax, ax2] = plt.subplots(2, 1, figsize=(15,12), constrained_layout=True)
data.groupby(['hour','day']).mean()['diesel'].unstack().plot(ax=ax)
data_2019.groupby(['hour','day']).mean()['diesel'].unstack().plot(ax=ax2)
if name:
ax.set_title(name)
ax2.set_title(name+' 2019')
else:
ax.set_title(uuid)
ax.set_title(uuid+' 2019')
#ax.legend()
ax.set_xticks(hourly_ticks)
ax2.set_xticks(hourly_ticks)
#ax.set_xlim(6,23)
#ax2.set_xlim(6,23)
fig.suptitle('Hourly prices grouped on weekdays', fontsize=16)
if save:
fig.savefig(save)
plot_weekday4uuid('005056ba-7cb6-1ed2-bceb-82ea369c0d2d', 'diesel', 'Star') #, './hourly_based_weekdays.png')
Here we have a big difference in the years before 2019, there was Saturday and Sunday at 8 pm a obvious lowest price. In 2019 this is not more noticeable on Sundays but on Saturday it seems to be still the best time for cheapest prices.
Now let's see some statistics from all stations in this area and make a nice plot of it with plotly.
prices_pd
prices_pd.diesel.dropna(inplace=True)
# group on station_uuid and have a look at "diesel"
grouped = prices_pd.groupby('station_uuid')[['diesel']]
grouped.describe()
# calculate some statistical values
var_price = grouped.var()
min_price = grouped.min()
std_price = grouped.std()
var_price.rename(inplace=True, columns={"diesel":"var"})
min_price.rename(inplace=True, columns={"diesel":"min"})
std_price.rename(inplace=True, columns={"diesel":"std"})
short_pd = stations_pd.copy()
short_pd.set_index(['uuid'], inplace=True)
short_pd = short_pd.join(var_price)
short_pd = short_pd.join(min_price)
short_pd = short_pd.join(std_price)
short_pd.dropna(inplace=True, subset=['var', 'min', 'std'])
short_pd.info()
fig = px.scatter_mapbox(short_pd, lat="latitude", lon="longitude", hover_name="name", hover_data=["city", "post_code"],
color_discrete_sequence=["fuchsia"], zoom=10, width=700, height=700, size="var", color="min", title="Variance(size) and Min price")
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
# Load data if you start here
prices_pd, stations_pd = load_data('./Data/prices_40.sql')
All libraries and data was loaded before, here i want to build a forecast model for fuel prices. First I'll we have a closer look to the data with the statsmodel package.
star_bahn = get_data4uid(prices_pd, '005056ba-7cb6-1ed2-bceb-82ea369c0d2d', 'diesel')
star_bahn.info()
# i want to use daily values for the ongoing analysis
# use ffill to have a value for each day, otherwise ARIMA model has problems
star_bahn.dropna(inplace=True)
star_day = star_bahn.resample('d').mean().ffill()
star_day = star_day.asfreq('d')
star_day.dropna(inplace=True)
# let's see what is in the data, let assume a weekly seasonal
decompostion = sm.tsa.seasonal_decompose(star_day, period=52, model='additive')
decompostion.plot();
# now let's see if we a yearly seasonal aspect inside
decompostion = sm.tsa.seasonal_decompose(star_day, period=365, model='additive')
decompostion.plot();
In the above plots you can easily see, there is a trend and a seasonal aspect in the data. In the first plot you'll see the weekly seasonal aspects and in the second plot you can also see a yearly seasonal aspect in the diesel price. Therefore a standard ARIMA model will have problems with such data, even the SARIMA model, because the data is not stationary.
The stationarity can be forced with the difference, let's see...
decompostion = sm.tsa.seasonal_decompose(star_day.diff().dropna(), period=52, model='additive')
decompostion.plot();
decompostion = sm.tsa.seasonal_decompose(star_day.diff().dropna().diff().dropna(), period=52, model='additive')
decompostion.plot();
Even in the first difference there is a small trend, it is removed when we diff a second time.
Let's see if it is also seen in ACF and PACF.
fig,ax = plt.subplots(2,1,figsize=(20,10))
fig = sm.graphics.tsa.plot_acf(star_day.dropna(), lags=20, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(star_day.dropna(), lags=20, ax=ax[1])
plt.show()
So from the ACF there is also clearly, that the raw data is not stationary. Let's see hte ACF and PACF for the first difference.
fig,ax = plt.subplots(2,1,figsize=(20,10))
fig = sm.graphics.tsa.plot_acf(star_day.diff().dropna(), lags=20, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(star_day.diff().dropna(), lags=20, ax=ax[1])
plt.show()
The original data is not stationary, seen in the Autocorrelation part. But the first difference seems to be good (means stationary). Let's see what the ADF and KPSS test will tell us about the raw data and the first difference.
# found this summary for all values of Adfuller and KPSS test usefull and therefore copy and pasted it here
def stationarity_test(ts):
print('Results of ADF Test:')
df = sm.tsa.adfuller(ts, autolag='AIC')
df_out = pd.Series(df[0:3], index = ['Test Statistic', 'p-value', 'Number of lags'])
for key,value in df[4].items():
df_out['Critical Value (%s)'%key] = value
print(df_out)
#When the test statistic is greater than the critical value, we fail to reject the null hypothesis (which means the series is not stationary).
stationarity_test(star_day.diesel.dropna())
stationarity_test(star_day.diesel.diff().dropna())
The Adfuller test for the first difference is also ok, the p-value is nearly zero, means the first difference is stationary. So we can use d=1 for the ARIMA model.
The ACF gives a hint for the q value (MA) and the PACF gives a hint for the p value (AR) for the ARIMA model. In the ACF/PACF plot from first difference you can see in ACF lag 1, (2) and 7 outside the benchmark and for the PACF lag 1, 2, (3), 7, 8, 9 outside the benchmark. As a rule of thumb you should start with the least amount of lags outside the benchmark. So i'll start with q of 1.
First i've to prepare the dataset index, it is a datetime index with localization. I will remove the localization and convert the index to a Datetime Index with period 'D'.
star_day.index
star_day = star_day.tz_localize(None)
star_day.index = pd.DatetimeIndex(star_day.index).to_period('D')
star_day.index
Then I split the data in train and test. For train i'll use the data from 2015-2018 and for test i'll use 2019.
# split train and test data
train = star_day['2015-01-01':'2018-12-31'].diesel
test = star_day['2019-01-01':].diesel
train.size, test.size
Let's see what the arma_order_select_ic gives as best parameter for (p, q).
order_sel = sm.tsa.arma_order_select_ic(train, max_ar = 4, max_ma = 4, ic=['aic', 'bic'], trend='c')
print('ARIMA: ', order_sel['aic_min_order'], order_sel['bic_min_order'])
def fit_arima(order, season_order):
'''Fit Arima model and give back result and predictions'''
print(order, season_order)
result = sm.tsa.statespace.SARIMAX(endog=train, order=order, seasonal_order=season_order, enforce_invertibility=False, enforce_stationarity=False).fit()
print(result.summary())
pred = result.forecast(steps=354)
return result, pred
def kpi_plot(pred):
'''Prints the MSE and R2 Score and plots the prediction vs original'''
print('ARIMA model MSE:{}'.format(mean_squared_error(test,pred)))
print('R2 Score: {}'.format(r2_score(test,pred)))
pred.plot(label='predicted')
test.plot(label='test')
plt.fill_between(pred.index,
pred,
test, color='k', alpha=.2)
plt.legend()
As described above i'll start only with q=1 as first approach.
result, pred = fit_arima((0,1,1), (0,1,1,52))
kpi_plot(pred)
result, pred = fit_arima((3,1,1), (3,1,1,52))
kpi_plot(pred)
result, pred = fit_arima((3,1,2), (3,1,2,52))
kpi_plot(pred)
The ARIMA (3,1,2) or (3,1,1) has the best prediction for the test data set. Also it has big differences in some months. In this calculation, the (3,1,1) was slightly better.
I will now try to build a LSTM model for prediction. Therefore i will scale the input and divide in three sets: train, validation and test data.
# the values have to be normalized for LSTM
values = star_day['diesel'].values.reshape(-1,1)
scaler = MinMaxScaler(feature_range=(0,1))
scaled = scaler.fit_transform(values)
# use the same data for training up to 2018
train_size = len(star_day[:'2018-12-31'])
vali_size = 31 # let's take 1 month as validation set for fitting
test_size = len(scaled) - train_size - vali_size
train, vali, test = scaled[:train_size,:], scaled[train_size:train_size+vali_size,:], scaled[train_size+vali_size:, :]
print(len(train), len(vali), len(test))
def create_data(dataset, look_back=1):
'''creates two array of x and y out of the given array
Input: Array of data, steps to look back
Output: X, Y
'''
dataX, dataY = [], []
for i in range(len(dataset) - look_back):
a = dataset[i:(i+look_back), 0]
dataX.append(a)
dataY.append(dataset[i+look_back, 0])
print(len(dataY))
return np.array(dataX), np.array(dataY)
look_back = 2
trainX, trainY = create_data(train, look_back)
valiX, valiY = create_data(vali, look_back)
testX, testY = create_data(test, look_back)
# reshape to make it usable as input for LSTM
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
valiX = np.reshape(valiX, (valiX.shape[0], 1, valiX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
# build a LSTM model
model = Sequential()
model.add(LSTM(50, input_shape=(trainX.shape[1], trainX.shape[2]), return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
history = model.fit(trainX, trainY, epochs=300, batch_size=100, validation_data=(valiX, valiY), verbose=0, shuffle=False)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.legend();
history.params
# make predictions
train_pre = model.predict(trainX)
vali_pre = model.predict(valiX)
test_pre = model.predict(testX)
# invert predictions
train_pre = scaler.inverse_transform(train_pre)
trainY = scaler.inverse_transform([trainY])
vali_pre = scaler.inverse_transform(vali_pre)
valiY = scaler.inverse_transform([valiY])
test_pre = scaler.inverse_transform(test_pre)
testY = scaler.inverse_transform([testY])
# calculate mean squared error and R2 score
trainScore = mean_squared_error(trainY[0], train_pre[:,0])
print('Train Score MSE {}'.format(trainScore))
valiScore = mean_squared_error(valiY[0], vali_pre[:,0])
print('Validation Score MSE {}'.format(trainScore))
testScore = mean_squared_error(testY[0], test_pre[:,0])
print('Test Score MSE {}'.format(testScore))
print('Train R2 Score: {}'.format(r2_score(trainY[0],train_pre[:,0])))
print('Validation R2 Score: {}'.format(r2_score(valiY[0],vali_pre[:,0])))
print('Test R2 Score: {}'.format(r2_score(testY[0],test_pre[:,0])))
# shift train predictions for plotting
trainPredictPlot = np.empty_like(scaled)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(train_pre)+look_back, :] = train_pre
#shift test predictions for plotting
valiPredictPlot = np.empty_like(scaled)
valiPredictPlot[:, :] = np.nan
valiPredictPlot[len(train_pre)+(look_back*2)-2:len(train_pre)+vali_size, :] = vali_pre
# shift test predictions for plotting
testPredictPlot = np.empty_like(scaled)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(train_pre)+vali_size+(look_back*2)-2:len(scaled)-2, :] = test_pre
# plot baseline and predictions
plt.plot(scaler.inverse_transform(scaled))
plt.plot(trainPredictPlot)
plt.plot(valiPredictPlot)
plt.plot(testPredictPlot)
plt.legend(['original', 'train', 'validation', 'test'])
plt.show()
# make a dataframe of the predictions and set index, the test data starts val_size + lookback in our case the 3rd february 2019
test_pre_pd = pd.DataFrame(test_pre)
test_pre_pd.index = star_day['2019-02-03':'2019-12-20'].index
star_day['2019-02-03':'2019-12-20'].index
test_pre_pd.index
# i have to convert the index back to DatetimeIndex, because otherwise i get an error when plotting
star_day.index = pd.DatetimeIndex(star_day.index.to_timestamp())
test_pre_pd.index = pd.DatetimeIndex(test_pre_pd.index.to_timestamp())
plt.plot(star_day['2019-02-03':'2019-12-20'], label='True')
plt.plot(test_pre_pd, label='Predicted')
plt.legend();
In the first view the prediction of the test data on the right side in green seems to be very similar to the original data. We will have a closer look inside.
# let's have a closer look at the test data
plt.plot(star_day['2019-12-01':'2019-12-20'], label='True')
plt.plot(test_pre_pd['2019-12-01':'2019-12-20'], label='Predicted')
plt.legend();
The plots shows that the model fits very well even to the test data, but it is more a "follower" than a prediction.
The LSTM prediction is based on the test data, but i'll want to see if the prediction is also working in a sequence, that means use real start values and the go on with the prediction as new input data. I'll found an example from (Raval, S.)[https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo/blob/master/lstm.py] and adapted it to my case.
# code adapted from https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo/blob/master/lstm.py
def predict_sequence_full(model, data, window_size):
print('Predicting sequence...')
curr_frame = data[0]
predicted = []
for i in range(len(data)):
# predict one step
predicted.append(model.predict(curr_frame[np.newaxis, :, :])[0,0])
# add the prediction at the end
curr_frame = np.insert(curr_frame[0], [window_size], predicted[-1], axis=0)
# drop the first value, so we have new input for prediction with the last prediction
curr_frame = curr_frame[np.newaxis, 1:]
return predicted
# let's predict the test Sequence, but we'll only use the first data set
full_predict = predict_sequence_full(model, testX, 2)
plt.plot(scaler.inverse_transform(np.array(full_predict).reshape(-1,1)));
Here we can see, that a prediction only starting with the first test data set as input value and then forecasting more than one day is not possible with this model. It will come to a limit and not goes up and down like the original data.